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Summary 

The law of likelihood underlies a general framework, known as the likelihood paradigm, for representing 

~0"N~ ' anc ^ m t er P re ti n g statistical evidence. As stated, the law applies only to simple hypotheses, and there have 

been reservations about extending the law to composite hypotheses, despite their tremendous relevance in 

statistical applications. This paper proposes a generalization of the law of likelihood for composite hypothe- 
cs ' 

ses. The generalized law is developed in an axiomatic fashion, illustrated with real examples, and examined in 
an asymptotic analysis. Previous concerns about including composite hypotheses in the likelihood paradigm 
are discussed in light of the new developments. The generalized law of likelihood is compared with other 
f-H | likelihood-based methods and its practical implications are noted. Lastly, a discussion is given on how to 

use the generalized law to interpret published results of hypothesis tests as reduced data when the full data 



in 



are not available. 
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! 1 Introduction 

q . 

A major part of statistics is to interpret observed data as statistical evidence, often in response to questions 

o ; 

Q\ like "what do the data say?" . Yet there is no consensus among statisticians on what constitutes statistical 

o 

evidence and how to measure its strength. Hypothesis tests and posterior probabilities are commonly used 
to interpret and communicate statistical evidence with regard to two competing hypotheses. The Neyman- 

X 

Pearson theory for testing hypotheses was developed under a decision-theoretic framework that attempts to 
answer such questions as "what should I do?" . The theory can lead to serious log ical inconsis tencies when 



19971 Chapter 2). 



used to address the problem of representing and interpreting statistical evidence ([RovalU 
The Bayesian approach, on the other hand, is more appropriate for questions of the form "what should I 
believe?" . A posterior distribution, which incorporates prior information as well as the data, may not provide 
an objective representation of evidence in the observed data alone. A proper concept of evidence is missing 
from standard statistical theories. 



The missing concept of evidence can be found in what iHackind (|196a ) termed the law of likelihood (LL): 



If one hypothesis, Hi, implies that a random variable X takes the value x with probability fi(x), 
while another hypothesis, H2, implies that the probability is fi(x), then the observation X = x is 
evidence supporting Hi over H2 if fi(x) > f2(x), and the likelihood ratio, 7/2(2;), measures 



1 



the strength of that evidence. 



This point of view has led to a likelihood paradigm for interpre ting statistical evidence, which c arefully 



distinguishes evidence from error probabilities and personal belief (|Rovall , 



1994, 



1997 



Blumc 



2002). 



Royal! 



(|l997l ) also proposes benchmarks for the strength of statistical evidence. Specifically, a likelihood ratio (LR) 



exceed ing k — 8 is considered fairly strong evidence, while k = 32 is used to define strong evi dence. 
([20001 ) analyzes the probability of observing misle ading evidence under par amet ric models, an d 



Blume etal 



Blume 



Royal! 



2008) 



( 20071 ) develop 



Zhang (2008) advocates 



provides a parallel analysis for sequential trials. iRovall and Tsoul (|2003l ) and 
adjusted likelihood functions with certain robustness properties under model failure, 
the use of empirical likelihood functions in nonparametric and semiparametric situations. 

Most of the discussion so far about the likelihood paradigm has been limited to simple hypotheses. This 
is in sharp contrast to the tremendous relevance of composite hypotheses in statistical applications. In 
confirmatory clinical trials, for example, the primary objective is often to demonstrate that a new treatment 
is superior to a placebo/sham control, or not inferior to the standard of care by more than a specified amount. 
Also, there are bioequivalence trials designed t o show that a gene ric drug is similar to a brand drug with 



respect to pharmacokinetic parameters. While 



Rovalll ( 1997 



2000) ha s con sidered some very special types 



of composite hypotheses (to be described in Section^, he and iBlumd |2002) have not encouraged efforts to 
further exten d the LL to general composite hypothese, due to concerns that will be addressed in Section [5] 
Blumd (|2002l ) argues that a graph of the likelihood function suffices for an evidential analysis and "no further 
reduction or summarization of the evidence is necessary" . However, in many situations including the clinical 
studies mentioned earlier, it is important to not only look at the graph but also decide explicitly whether 
there is evidence supporting one specific hypothesis over another and, if so, how strong that evidence is. 
The LL makes this possible for simple hypothes e s, and a reasonable generalization of the law for composite 



hypotheses will certainly be helpful. 



Heetal 



( 20071) consider this problem in a finite parameter space 
without suggesting a practical solution. This article proposes a solution for general composite hypotheses. 

The proposed solution will be derived in an axiomatic fashion in the next section, illustrated with real 
examples in Section^ and analyzed asymptotically in Section^ Section [5] discusses previous concerns about 
including composite hypotheses in the likelihood paradigm. Section [6] compares the proposed method with 
other likelihood-based methods. Lastly, a discussion is given in Section [7] on how to interpret published 
results of hypothesis tests as reduced data when the full data are not available. 



2 Generalizing the law of likelihood 

Let X represent the data and suppose X follows a distribution with density /(•; 9), which is known up to a 
parameter (vector) 9 taking values in O. Then the likelihood for 9, based on the observation X — x, is given 
by L{9) = /(x; 9). According to the LL, the data provide evidence supporting one parameter value 9\ over 
another value 9^ if L{9\) > £(#2), and the strength of that evidence is measured by the LR L(6\)/L{9-2). 
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For this purpose, it is irrelevant whether the values 9\ and #2 are predetermined or data-driven. 

What do the data say about general hypotheses of the form Hi : 9 € Q\ C O versus H2 : € 82 C 0? 
Only some special cases have been considered. Obviously, if the likelihood L is constant over each hypothesis 
with respectiv e values L\ and L2, then the two hypotheses should be compared on the basis of the LR L±/ L2 
19971 Sections 1.7, 1.8). It has also been suggested that if the images £(Oi) and £(62) are intervals 



(jRovall . 



that do not overlap, then the evidence supports the hypothesis with larger l ikelihood values over the othe r 
one, with little discussion on how to measure the strength of that evidence (jRovalll . l2000t iHe et all 12007) . 
We shall take as an axiom a slight generalization of the latter suggestion. 

Axiom 1. 7/inf L(Oi) > sup £(82), then there is evidence supporting Hi over i?2- 

Also, it seems reasonable to expect evidential interpretations to be logically consistent, in the following 
sense. 

Axiom 2. If there is evidence supporting HI over H 2 and H* implies Hi, then the evidence also supports 
Hi over H 2 ■ 

These axioms together suggest the following generalization of the LL. 

Theorem 1. Let Axioms 1 and 2 hold. //supL(Oi) > supi(82), then there is evidence supporting Hi 
over H2 ■ 

Proof. Let supL(Oi) > sup £(82); then Oi is not empty and there exists 81 € Oi such that L(6i) > 
sup £(82). It follows from Axiom 1 that the simple hypothesis H* : — Oi is supported over i?2- The result 
now follows directly from Axiom 2. □ 

It also seems natural to use the generalized likelihood ratio (GLR) supL(Oi)/ sup £(82) to measure the 
strength of the evidence, although this does not follow from the axioms. This generalized law of likelihood 
(GLL) is consistent with the origina l law for sim ple hypotheses, and with previous suggestions for special 



cases of composite hyotheses (jRoval] 



1997 



2000). Moreover, the GLL coincides with the profile likelihood 
approach in the presence of nuisance parameters. Suppose 6 = (7, w), where 7 is of primary interest and lo 
is a nuisance parameter. While there may be ad hoc solutions available depending on the specific problem, a 
viable general approach is to represent the evide nce abo u t 7 w ith the profile likelihood £(7) = sup^ L(j,oj), 



which has good properties that justify its use (|Rovalll . l200Cj Section 5). The GLL, treating a "simple" 
hypothesis about 7 as a composite one about (7,0;), would lead to the same answer for comparing two 
values of 7. Of course, unlike the profile likelihood approach, the GLL also applies to arbitrary composite 
hypotheses concerning (7,0;). 

As a by-product, the GLL allows one to assess the "absolute" evidence about a single parameter value or 
a single hypothesis, using its complement as t he default compara tor. This would not be possible without a 



mechanism to deal with composite hypotheses (jRoyall , 



1997 



20001 ). Under the GLL, a hypothesis H x : 6 € 6i 
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is supported by the data if sup-L(Gi) > sup L(Q1), where the superscript c denotes complement, and the 
strength of that evidence is measured by the GLR supL(Oi)/ supi(9J). In particular, the evidence about a 
single parameter value, say 9\, is represented by the ratio L(9i)/ supg^g 1 L{6). As we shall see in the examples 
to follow, excluding a single parameter value does not usually alter the supremum of the likelihood, which 
means L{9\)/ sup e ^g 1 L(6) = L{9\)/ supL(0) < 1. Thus it is usually impossible to obtain empirical evidence 
supporting a single parameter value in a smooth model over its complement. 



3 Real examples 



Rovalll (1997 



The GLL will now be illustrated with some real examples. The first example, taken from 
Section 1.9), is a clinical study in which 17 subjects were enrolled and given a new treatment. The outcome 
of interest was a binary indicator of treatment success, for which a binomial model is assumed. The success 
probability for the new treatment, denoted by 9, was to be compared with the same probability for a 
standard treatment which was believed to be about 0.2. This gave rise to two composite hypotheses of 
interest: H% : 9 < 0.2 versus H^ : 9 > 0.2. At the end of the study, nine subjects were found to have achieved 
treatment success and the resulting likelihood for 9 is plotted in Figure [1] (In this and the subsequent 
likelihood plo ts, each likelihood function is divided by its maximum value so the peak value is invariably 1.) 
Rovalll (|1997r ) recognizes that the LL does not allow us to compare Hi and H2 directly, and suggests that 
we should instead make pairwise comparisons between selected parameter values in H\ and values in Hi- 
However, the choice of parameter values for pairwise comparisons can be rather subjective, and even after 
a large number of pairwise comparisons it may remain unclear how to answer the original question: Do the 
data support H\ or Hi and by how much? An application of the GLL yields a direct, unambiguous and 
objective answer: H2 is supported over H\ with a GLR of 91, which indicates strong evidence. 

[Figure [1] about here.] 



Rodarv et al 



(|1989l ) and later 



The second example is a randomized clinical study first reported by 
discussed by many authors. The trial enrolled 164 children with nephroblastoma, who were randomly 
assigned to either chemotherapy or radiation therapy. The primary objective of the trial was to demonstrate 
that chemotherapy is non-inferior to radiation therapy with respect to the response rate. More precisely, 
non-inferiority here means that the response rate for chemotherapy is not lower than that for radiation 
therapy by more than a margin of 10%, which is considered the smallest clinically meaningful difference 
between two groups. The observed response rates were 94.3% (83/88) for chemotherapy and 90.8% (69/76) 
for radiation therapy. The profile likelihood function f or the d i fferen ce between the response rates in the 
two groups is computed using a bisection method as in I Zhang (|2006l ) and plotted in Figure [U Under the 
GLL, the non-inferiority hypothesis is strongly supported with a GLR of 138. In fact, with a higher observed 
response rate in the chemotherapy group, there is even evidence supporting the superiority of chemotherapy 



to radiation therapy. This latter piece of evidence is rather weak, though, with a GLR of 1.4. 
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[Figure [2] about here.] 



Finally, let us consider a bioequivalence trial described in lWelleld (|2003l . Chapter 9). Bioequivalence trials 
are conducted to show that a generic drug or new formulation (test) is nearly equivalent in bioavailability to 
an approved brand- name drug or formulation (reference). There are different ways to measure bioavailability, 
but this example is primarily concerned with the area under the curve (AUC) for the serum concentration 
of the drug changing over time. The trial involved 25 patients and followed a crossover design where each 
patient was randomly assigned to a treatment sequence (test followed by reference or the opposite, with equal 
probabilities). Let (Yt, Yr) denote the log-tra nsformed AUC me asurements in the test and reference periods, 
respectively, on the same subject. Following 



Choi et al 



( 20081 ). we assume that there are no sequence or 



period effects and that the measurements follow a simple bivariate normal model: 




pa T a R 




pa T a R 



In this setting, it is natural to assess bioequivalence by comparing px with p R and ot with <tr. Figures[3]and 
[H display the profile likelihood functions for pt — PR and ct/ctr, respectively, both based on formulas given 
bv lChoi et al.l (|2008l Appendix A). In terms of the means, bioequivalence is usually defined as \pr — Pr\ < 
0.223, which corresponds to 0.8 < exp(^-r)/ exp(pn) < 1.25. As is clear in Figure [3j this bioequivalence 
hypothesis enjoys overwhelming support by the observed data, with a GLR greater than 10 6 . There is 
not a general definition of bioequivalence in terms of the standard deviations. One might, however, follow 
the same reasoning about the exponentiated means and consider the standard deviations close enough if 
0.8 < ut/cr < 1-25. The evidence regarding this latter hypothesis is largely neutral, with a GLR of 1.1. 

[Figure [3] about here.] 

[Figure 2] about here.] 



4 Large-sample theory 

In this section we consider the behavior of the GLL in large samples. Specifically, let X = (Yj., . . . , Y n ), a 
collection of independent copies of some random variable Y, and let the density of Y be modeled as g(-; 9), 
9 £ 0. Then the likelihood for 9, based on the observations Yj, = yi, i = 1, . . . , n, is given by 

n 

L n (9) = Y[g( yi ;9). 

i=l 

It is instructive to begin with the simple case of two simple hypotheses: H\ : 9 — 9\ versus H2 : 6 — 62. The 
law of large numbers implies that, with probability 1, 

1 1 ™ 

l n {6) :=-\agL n {d) = - V log 3(^5 9) -> E\ogg{Y-9) =: 1^(9). 
n n * — ' 

i=l 

It follows that L n (9 1 )/L n (9 2 ) tends to 00 if Zoo(#i) > ^00(^2) and to if loo{9\) < ^00(^2)- Thus the LL 
essentially orders the values in according to loo- If ^00 has a unique maximum at #0, then 6q will eventually 
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dominate any other fixed value in 0. Obviously, 9q is just the true value of 9 if the model (?(•; 6) is correctly 
specified and suitably id entified. Under an incorrect model, 6*g may be considered the "object of inference" 



( Rov all and Tsoul . 



2003). 



Under the GLL, the subsets of O are ordered according the suprema of their images under Zoo- This can 
be formalized as follows. 

Theorem 2. Let Oi,02 be subsets of Q. If the collection of functions {\ogg(Y;9) : 9 £ Qj} is Glivenko- 
Cantelli for j = 1,2, then, with probability 1, 



supL„(6 2 ) 



supL„(6i) J 00 if supZoo(ei) > supZ 0O (9 2 ), 

if supZoo(6i) < supZoo(0 2 ). 

This follows directly from the uniform law of large numbers. An ex tensive discussion of the Gl i yenko - 
Cantelli property, including techniques for its verification, can be found in Ivan der Vaart and Wellnerl (|1996). 
Certainly, any parameter set that contains 9q (if it exists) attains the global maximum of Zoo- On the other 
hand, a set that does not contain 9q may or may not attain the global maximum, depending on certain 
properties of Zoo- Some general characterizations are given below. 

Lemma 1. Suppose Zoo is continuous and maximized at 9q. If 9q £ Oi (closure ofO\), then supZoc(Oi) = 
max Zqo (O). 

Lemma 2. Suppose 9q is a well-separated maximizer of in the sense that for every e > 0, 

sup Zoo(0) < Zoo^o). 

6:\\e-e \\>e 

If 9q ^ Oi, then supZoo(6i) < maxZ 00 (0). 

The preceding discussion leaves open the case that supZoo(Oi) = supZoo(©2), which can happen if 9$ lies 
on the boundary between Oi and 2 or, more generally, if 9q £ (Oi H 2 )- Take the first example in Section 
[3] comparing H\ : 9 < 0.2 with H2 : 9 > 0.2. If the true value of 9 is 0.1 or 0.5, then the above results show 
that Hi or H2, respectively, will eventually dominate the other hypothesis. However, if 9 = 0.2, then the 
two hypotheses are tied with respect to sup Zoo and a closer examination is required. The following theorem 
characterizes the asymptotic distribution of the GLR supL„(Oi)/supL„(02) in terms of the limits of the 
sets Aj n — y/n(Oj — 9q), j = 1,2. A sequence of sets A n converges to a set A if A is the set of all limits 
lima„ of convergent sequences (a„) with a n € A n for every n and, moreover, the limit a = lim^ a„ k of every 
convergent sequence (a„ fe ) with a nk £ A nk for every k is contained in A. Also define the distance between a 
vector b and a set A in the same Euclidean space as ||6 — = inf a£j 4 ||6 — a\\. 

Theorem 3. Suppo se the model {g(-;9) : 9 G 0} is differ entialble in quadratic mean at 9q in the sense of 



van der Vaari 



1998 , Section 7.2) with non-singular Fisher information matrix Ie Q . Suppose there exists a 



measurable function h such that Ee {h(Y) 2 } < 00 and that, for every 9\ and #2 in a neighborhood of 9q, 

|log5(y;0i)-log0(tf;0a)| < %)||*i - 0*1 



G 



Let 0i ; 02 be subsets of for which the restricted maximum likelihood estimators 9j n = argmaxg^ L n , 
j = 1, 2, are consistent under 9q and the sequence of sets Aj n (defined above) converges to some Aj, j = 1, 2. 
Then, under 9q, we have 



2{logsupi n (©i) - logsupL n (8 2 )} 
for W normally distributed with mean and variance matrix Ig 1 



Cm 2 



\il /2 w 



This result parallels Theorem 16.7 of 



van der Vaartl (|1998l ) for likelihood ratio tests and can be proved 



using similar arguments. Unlike Theorem [51 Theorem [3J does require correct specification of the model 
<?(•; 9). The differentiability condition for the model g(-; 9) is satisfied for most models used in practice. The 
consistency of the restricted maximum likelihood estimators 9j n typically requires that 9q G (8i n 02)- In 
the more general case 9q G (0i n ©2), one could replace each Qj with Qj when applying the theorem and 
the conclusion would still hold for the original Qj as long as L n is continuous. To understand the limiting 

1 /2 

distribution given in Theorem [31 note that Ig W is a standard normal vector of the same dimension as 9. 

1/2 1/2 

If 9q is interior to Qj, then Aj is the entire Euclidean space and I s V — I 6o Aj = 0. The situation is more 
complicated if 9q lies on the boundary of a parameter set. Consider, again, the first example in Section [3] 
When 6*o = 0.2 in this example, Ai and A 2 are the negative and positive halflines respectively, and the GLR 
supL„(8i)/ supL„(82) converges in distribution to exp[{I(Z < 0) —1(Z > 0)}Z 2 /2], where Z is a standard 
normal variable and I(-) is the indicator function. Thus Hi and H2 are virtually symmetric in the limit even 
though one is technically correct and the other is not. As a more extreme example, suppose #0 is an interior 
point of Q and take 81 = {9o} and 82 = 8^. Then A\ — {0} and A2 equals the entire Euclidean space, 
so the limit in Theorem [3J is — Xdim(0V ^ n ^ s resu ^ adds to the discussion at the end of Section [2] with an 
asymptotic approximation. 



5 Discussion of previous concerns 



Both 



Royal] 1 19971 ) and iBlumd (|2002l) have emphasized that the LL is silent about composite hypotheses. 



Neither author has expressed much optimism about expanding the likelihood paradigm to i nclude composite 



hypotheses, d espite their practical relevance. The main concerns expressed in 



Blumd ( 2002 



Rovalll (jl997l Section 1.9) and 



Section 2.6) are outlined and addressed below. 



5.1 Logical evidence versus statistical evidence 



Rovalll <| 19971) points out the distinction between logical evidence and statistical evidence, and argues that 
the former sh ould not be substituted for the latter. This can be illustrated with the following example from 
Rovalll (jl997l . Section 1.7.2). Suppose that 81 = {#1} and 82 = -j^i,^} for distinct parameter values 9\ 
and 9 2 - Because 81 C 82, the hypothesis H2 : 9 £ 82 certainly appears more plausible than Hi : 6 € ©1. 
However, if L{9\) = £(#2) then the LL says that Hi and H2 are equally well supported by the data. This is 
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not surprising because the LL concerns the statistical evidence alone, irrespective of the logical relationship 
between the two hypotheses. 

On the other hand, it is reasonable to require that statistical evidence be interpreted in a logically 
consistent manner. In the example of the above paragraph, it seems difficult to imagine a data configuration 
that can be naturally interpreted as supporting Hi over H2, even if L(9\) and £(#2) are allowed to differ in 
any possible way. A theory that allows such an illogical conclusion would be really troubling. This is the 
rationale for imposing Axiom 2 in developing the GLL. 

The GLL respects logical relationships among hypotheses without substituting logical evidence for sta- 
tistical evidence. This can be illustrated with the same example discussed in the above two paragraphs. 
Under the GLL, there can never be strict support for Hi over H2, eliminating logical inconsistencies. On 
the other hand, the GLL does not confuse the logical relationship Hi H2 with statistical evidence. It 
never lends evidential support to H2 over Hi without a solid statistical basis, because the GL R of H2 to Hi 
cannot be greater than 1 unless L(d 2 ) > L(9i). This can also be seen in another example from IllovalJ |l997l 



Section 1.9) with 9i = {9i,9 3 }, 2 = {#2} and L(9i) < L(6 2 ) < L(9 3 ). Based on the latter inequality, 
Hi is supported over H2 under the GLL. This conclusion will not hold if the likelihood becomes such that 
L(9i) < £(6*2) = L(9 3 ) even though the logical relationship between Hi and H2 is unchanged. More gener- 
ally, for any hypothesis Hi : 6 € Si to be supported over another hypothesis H2 : 9 6 O2, the GLL requires 
the existence of 9i £ Oi such that L(9i) > supL(Oi), a crucial piece of statistical evidence that cannot be 
replaced by any logical evidence. 

5.2 The role of a prior distribution 

If a prior distribution can be specified for 9, then the probabilities P(X = x\Hi) and P(X = x \H2) can be 



evaluated and the LL can be used to assess the evidence about Hi versus i?2- As 



Rovalll (|l997l ) and 



Blume 



(2002) point out. this Bayesian approach reduces composite hypotheses into simple ones by modifying the 
probability model. The choice of a prior distribution can be arbitrary, and an LR that involves a prior 
distribution may not provide an objective representation of the observed evidence. 

Given his insight into the Bayesian approach, it is interesting that Royall supports his reservation about 
considering composite hypotheses in the likelihood paradigm with the following Bayesian observation. He 
observes that the posterior probability ratio P(Hi\X = x) / P(H2\X — x) can be larger or smaller than the 
prior probability ratio P( H^ ) / PjH^ ) , d epending on the prior distribution, unless one hypothesis dominates 



the other as in Axiom 1 (|Rovalll 119971 Section 1.9). If anything, this appears to highlight the subjectivity 
inherent in the Bayesian approach. In no way does the above observation suggest that statistical evidence 
about composite hypotheses cannot be interpreted objectively. 
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5.3 The lack of a unique solution 

Blume's main concern about composite hypotheses appears to be that there is no unique way to deal 



with them ( Blume 



20021 Section 2.6). This, of course, is the case with many problems, including the 
very problem the LL aims to address. (Statisticians and scientists who have not yet accepted the LL 
wholeheartedly need not regard it as the unique approach to statistical evidence.) Within the likelihood 
paradigm, there are different ways to obtain a likelihood function for the parameter of interest in the 
presence of nuisance parameters (a special type of composite hypotheses). Among other possibilities, the 
profile likelihood approach has been shown to have des irable statistical properties and ap pears to be a viable 



1997 



2000 



Blume 



20021) . The GLL provides 



general approach to dealing with nuisance parameters (jRovalll . 
a natural extension of the profile likelihood approach to more general composite hypotheses. 

It is actually desirable to have a set of possi b le solu tions to choose from. Besides maximizing the likelihood 



over each hypothesis as in this paper, 



Blumd (|2002l ) also mentions other possible approaches to composite 



hypotheses (to illustrate the lack of a unique solution). One of the alternatives mentioned is to minimize 
the likelihood over each hypothesis, which clearly makes no sense. Suppose, for example, that L(8i) > for 
some 9\ and that inf L(Q) = 0, which is not uncommon. The minimum likelihood rule would then indicate 
(infinitely) strong evidence supporting H\ : = 9\ over H2 : & € O, even though the latter hypothesis is 
trivially true. Blume also mentions the possibility of averaging the likelihood over a composite hypothesis 
with a weight function, which is essentially the Bayesian approach. As noted earlier, the Bayesian approach 
relies on external information and may not provide an objective representation of the observed evidence. In 
contrast, The GLL provides an objective representation of evidence and avoids illogical and counterintuitive 
conclusions. 



6 Comparison with other procedures 
6.1 Connection with likelihood ratio tests 

As we have seen in the examples discussed so far, Oi and 02 need not be disjoint or exhaustive for the GLL 
to apply. There are, however, many applications where it is customary to take O2 = Of. These problems are 
usually tackled with statistical tests, such as likelihood ratio tests (LRTs). To be specific, let Hi : 9 £ 0i be 
considered the null hypothesis and H2 : 9 £ 0i the alternative; this suggests that evidence supporting H2 
over Hi is of particular interest. The LRT is based on the statistic 

sup £(6) _ sup £(61) V sup L(6 2 ) _ supL(G 2 ) 
supi(6i) supL(Oi) supL(6i) 

where V denotes maximum. The null hypothesis will formally be rejected if the above test statistic is greater 
than some critical value, which is typically greater than 1. Thus, for the purpose of seeking evidence for H2, 
the LRT statistic is essentially equivalent to the GLR sup £(82)/ supL(Oi) given by the GLL. 
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However, the same GLR can be interpreted in different ways. In the likelihood paradigm, the GLR is 
all that is needed to compare the two hypotheses. It determines both the nature and the strength of the 
evidence, and can be placed on a universal scale together with GLRs from different problems. The strength 
of statistical evidence is measured on a continuum and should be understood as such, even though descriptive 
benchmarks are sometimes used to facilitate communication. In an LRT, the test statistic is to be compared 
with a critical value before a dichotomous conclusion (whether to reject Hi in favor of H2) can be reached. 
The critical value is derived from the (asymptotic) n ull distributions of the test statistic and usually depends 



on the significance level and certain properties of Oi (jvan der Vaart 



19981 . Chapter 16). The use of a p- value 



provides some continuity and eliminates the dependence on the significance level, but a p- value still d epends 



on oth er features of the problem that are irrelevant from an evidential point of view. In general, 
(|19971 ) argues that hypothesis tests are not appropriate tools for interpreting data as evidence. 



Rovall 



6.2 Confidence sets versus support sets 

Denote by c a (9i) the LRT critical value for testing Hi : 9 = 9\ against H% ■ 6 7^ 9\ at significance level a. 
The associated 1 — a confidence set for 8 is given by {9 E : L(9) > sup L(Q) j c a (9)} . If c a (9) = c a , as 
is often the case, then the con fidence set is s imply \9 : L(9) > s up L(<d)/c a }. This, interestingly, coincides 
with a likelihood support set. iRovalll <| 19971 ) and iBlumd (|2002l) have discussed likelihood support sets in 
one-dimensional situations, where the support sets are typically intervals. In general, the l/k support set 
for 9 can be defined as 

S k = {9 : L{9) > supi(9)/fc}, k > 1. (1) 

Despite a similar appearance, a support set is to be interpreted differently than a confidence set. The usual 
interpretation of confidence sets in terms of long-run coverage does not fit well into the likelihood paradigm, 
where the emphasis is placed on understanding the observed data (as opposed to fictitious repetitions of 
the same experiment). The LL leads to the following interpretation of support sets. The values in the l/k 
support set Sk are "consistent with the obse rvations" in the sense that no other value can be better supported 



by a factor greater than k (jRovallL I1997L Section 1.12). An alternative, perhaps more straightforward 
interpretation is made available by the GLL. Recall from Section [2] that evidence about a single hypothesis 
could be evaluated using its complement as the default comparator. In this sense, Sk is simply the smallest 
parameter set supported by a factor of k. 

Theorem 4. If sup L(S) / sup L(S C ) > k for some Sc0, then Sk C S. 
Proof. By assumption, supL(5' c ) < sup L(S)/k < sup L(Q)/k. It follows that 



S c C {9 : L{6) < sup L(Q)/k} = S, 



A" 



from which the result is immediate. 



□ 
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6.3 Drawing support from support sets 



Choi et al 



(2008) propose to use profile likelihood support intervals to assess bioequivalence. As discussed 



in Section [31 the parameter of interest in this context is usually a scalar parameter 7 € T that summarizes 
the difference in bioavailability between a test drug or formulation and a reference. Suppose bioequivalence 
is defined as 7 € (7^, ju) = ^be for specified bounds and ju- Write 6 — (7,0;) for the entire parameter 
vector that determines the distribution of the data, with co € Q considered the nuisance parameter. Choi et 
al. work with the profile likelihood £(7) = sup u L(-~{,oj), from which support intervals for 7 can be derived 
as 

s fe = {7 e r : Z( 7 ) >su P L(r)/fc}, k > 1. 

Note that supL(r) = supL(0) and that Sk — {7 : (7, u>) <E Sk for some uj} with Sk defined by dXJ) . Choi et 
al. suggest that if Sk C Tbe then there is evidence supporting the bioequivalence hypothesis; the larger k is, 
the stronger the evidence. They also suggest what is essentially a sensitivity analysis using several support 
intervals (say with k = 5,8,32). Implicit in the latter suggestion is an attempt to quantify the strength of 
the evidence with the "largest" k for which the bioequivalence hypothesis is supported. 

The GLL provides a general theoretical basis for the above suggestions. To see this, note first that 
Sk C Tbe if and only if Sk C Tbe x i7 =: Qbe- Thus Choi et al.'s approach can be cast in terms of the 
parameter 6 and the associated support sets Sk- The following theorem justifies their usage of support sets 
for an arbitrary hypothesis concerning 9. 

Theorem 5. Let Qa C O and write ta = sup L(®a) / sup L(Q A ). (a) If Sk C &a for some k > 1, then 
r A>k. (b) Denote k* — sup{fc > 1 : Sk C Qa}, which will be set to 1 if there is no qualifying k. Then 
r A > 1 if an d only if k* > 1, in which case rA = k* . 



Proof. If S k C 9 A , then 



^ _ sup L(Q A ) > supL(Sk) > 
rA ~ supL(e^) - sup L[S%) ~ ' 



proving statement (a). The "if part of statement (b) follows directly from statement (a), which, by a 
limiting argument, further implies that rA > k*. To prove the "only if part, we can invoke Theorem U 
with S = Qa and k — rA, which also shows that rA < k*. The proof is complete upon combining the two 
arguments. □ 



7 Hypothesis tests as reduced data 

When reporting their research findings, scientists do not always provide the raw data and sometimes only 
present the final results of statistical tests concerning their research hypotheses. Without access to the raw 
data, an interested reader may not be able to produce the likelihood function for the parameter of interest 
and is often forced to work with the results of hypothesis tests. Such difficulties do not necessarily force us 
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out of the likelihood paradigm, as the GLL can still be used to interpret hypothesis tests as reduced data. 
This will be illustrated below for both dichotomous test results and p-values. 



7.1 Dichotomous test results 

Suppose the null hypothesis H\ is to be tested against the alternative Hi at significance level a. Let T = 1 
if Hi is rejected in favor of H2; otherwise. The GLR of H2 to Hi based on the result T = t is given by 

r T (t) = sup P{T = t)/ sup P(T = t), i = 0, 1. 

H2 Hi 



Rovalll (|1997l pp. 48-49). For an arbitrary Hi, 



In the case of simple hypotheses, this has been discussed by 
sup ffl P(T — 1) is the size of the test, which must not exceed a for a level-a test. In fact, except in very 
discrete cases, a reasonable test should have size close to a. On the other hand, sup^ 2 P{T = 1) is the 
maximum power of the test, which is generally greater than a and frequently equal to 1. Thus it seems 
reasonable to expect rr(l) > lj which justifies the usual interpretation of a rejection as evidence for the 
alternative. The GLR based on a non-rejection appears less predictable, consistent with the conventional 
wisdom that a failure to reject the null does not necessarily support the null. It should be noted that while 
the GLL appears consistent with conventional interpretations of hypothesis tests, it does not justify the use 
of such tests to reduce the data. The GLL should ideally be applied to the original data. 

More can be said about the GLR tt for some typical hypotheses concerning a scalar 8 € (0, 8) . Consider 
for instance the one-sided hypotheses H\ : 8 < 8* versus H2 : 8 > 8* . In terms of the power function 
7r(0) = P$(T = 1), the GLR can now be written as 



rr(t) = 



{1 - inf e>e . tt(0)}/{1 - inf e < 9 . ir{8)}, t = 0; 
^sup e>e , 7r(0)/ sup e < e , 7r(0), t = 1. 

In many examples, tt(8) is increasing in 8 with tt(8_+) = 0, tt(8*) = a and ir(6—) = 1. If this is the case, then 




Thus, if Hi is rejected, the resulting evidence supporting H2 over Hi is moderate (tt = 20) for a = 0.05 and 
strong (tt — 40) for a = 0.025. If H\ is not rejected, then Hi is supported over H2 but the evidence is very 
weak for common values of a. Sometimes the one-sided hypotheses are formulated as H[ : 8 = 8* versus 
H2 : 8 > 8*. Such a reformulation does not usually require a different test. Assuming the same properties 
of tt as stated above, we then have 



rr(t) 



{1 - inf e>e . tt(0)}/{1 - n(8*)} = 1, t = 0; 
sup 9>9 »7r(0)/7r(0*) = l/a, t = 1. 

The reformulated null hypothesis cannot be supported even if it is not rejected. 



12 



Suppose the simple null Hi : 9 = 9* is to be tested against the two-sided alternative H 2 : 9^9*. 
Assume the power function w(9) equals a at 9 = 9*, increases as 9 moves away from 9*, and tends to 1 as 9 
approaches 9 or 9. Then 

1, t = 0; 

rr(t) = < 

l/a, t=l. 

Again, the simple null is never supported. 

Now let us consider an equivalence testing problem with hypotheses Hi : \6 — 6*\ > 5 and H 2 : \6 — 9*\ < S 
for some 5 > 0. Here it may be reasonable to assume that tt(9) attains its maximum 7r max at 9 = 6*, decreases 
as 9 moves away from 9*, equals a at 9 = 9* ± S, and tends to as 9 approaches 9 or 9. The maximum 
power 7r max is usually less than 1 but greater than a. In this case the GLR of Hi to Hi is given by 



rr(t) 



1 - a, t = 0; 

TTmax/a, t = 1. 



This provides a realistic example where the maximum power of the test is relevant even after rejecting the 
null. 

7.2 p- values 

Let U denote a p- value for testing the null hypothesis Hi against the alternative H 2 . Depending on the nature 
of X and the procedure, U may be discrete or continuous. In either case we write fu for the probability 
density of U with respect to an appropriate measure. The GLR of H 2 to Hi based on the result U — u is 

ru{u) = sup fu(u)/ sup fu(u), < u < 1. 

H2 Hi 

In practice, U is often determined by a test statistic V through a smooth monotone function, such as 1 minus 
a reference distribution function, in which case r\j is equivalent to the analogous GLR based on V (denoted 
by r v ). 

For example, suppose X = (Yi, . . . , Y n ) is a random sample from N(p, a 2 ) with a 2 known, and consider 
the hypotheses Hi : /1 < versus H 2 : \i > 0. In this situation it is common to use the p- value U = 1 — &(V), 
where $ is the standard normal distribution function and V = n -1 / 2 Y^i=i ^il a - The corresponding GLR 
can be obtained as 



r v {u) = tv($ x (l - u)) = 

SU P M <0 VV* '\J--U)-y/UfJ,/V) 

i 2 /2), u<0.5- 

(2) 



= < 



cxp({$- 1 (l - m)} 2 /2), u<0.5; 
exp(-{$- 1 (l -u)} 2 /2), m>0.5, 



where <ft denotes the standard normal density function. Note that H 2 is supported over Hi if and only if 
U < 0.5. In fact, since V is sufficient for /i in this example, the above GLR is the same as that based on the 
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full data X. A less trivial example would be a two-sample problem with a random sample (Yii, . . . ,Y\ ni ) 
from iV(/ii,cr 2 ) and another sample (Y21, ■ ■ ■ ,Y2n 2 ) from N(p,2,& 2 )- Assume a 2 is known and consider the 
hypotheses Hi : fi2 < Mi versus H 2 : ^2 > Ml- Then the test statistic V = (n^ 1 + n 2 _1 ) _1 / 2 (y 2 — ^l)/ "; 
with Yj = nj 1 X^r=i Yij (i = •"■) 2)i would not be sufficient. Nonetheless, the GLR of H2 to Hi based on the 
p- value U = 1 — $(V) continues to follow expression ((2j> . 

When interpreting a p- value, one might be tempted to invoke the characterization of U as the "lowest" 
significance level at which Hi is rejected, that is, U = inf{a : T a = 1}, where T is subscripted to emphasize 
its dependence on a. The observation U = u implies that the use of any a > u would lead to a rejection 
and hence a GLR r<r a (1), and it might seem natural to use the quantity r + (u) — sup a>M rx a (1) to represent 
the evidence in the observed p- value. While this quantity may be easy to compute when a simple expression 
for TTaSX) is available as in Section [7.11 its interpretation can be problematic. First, a symmetric argument 
based on the dual identity U — sup{a : T a — 0} would lead to the dual quantity = inf Q<n rx a (0), 

which is generally different and typically smaller than r+(u). It is not clear how to reconcile the difference. 
More importantly, the GLRs rx a (t) are defined in Section I7TT1 for a fixed a. If a is allowed to depend on U, 
a random variable, then T a will become a different statistic to which the discussion of Section [7. II no longer 
applies. 
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Success Rate 



Figure 1: Likelihood function for the success rate 9 in the first example of Section [3l with the dashed line 
separating the hypotheses Hi : 9 < 0.2 and H 2 : 9 > 0.2. 



1G 




Figure 2: Profile likelihood function for the difference in response rate (chemotherapy — radiation therapy) 
in the second example of Section [3j with the dashed lines separating regions of inferiority, non-inferiority 
and superiority. 
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Difference in Mean 



Figure 3: Profile likelihood function for the difference in mean log-AUC (test — reference) in the third example 
of Section [3l with the dashed lines separating regions of bioequivalence (BE) and non-BE. 
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Ratio of Standard Deviations 



Figure 4: Profile likelihood function for the ratio of log-AUC standard deviations (test /reference) in the 
third example of Section [3l with the dashed lines separating regions of biocquivalence (BE) and non-BE. 
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